R-Ladies Tbilisi

Predicting Election Outcomes

David Sichinava, Ph.D., CRRC-Georgia / Tbilisi State University
June 15, 2017

Today's plan:

  • Brief refresher;
  • Install necessary libraries;
  • Read polling data for 2012 and 2016 elections;
  • Do forecasts based on polling data;

The art and the science of election forecasting:

Drawing

The Art

  • Treating polls as mini-elections in order to understand how voters think of the candidates / parties;
  • Electoral system;

The science

Drawing

Warning, graphic footage!

The science

  • The model
  • The data

The model

  • Depends on the electoral system;
  • Could be VERY SOPHISTICATED;

The model

  • Averages (arithmetic, moving, weighted…);
  • Time series (ARMA, ARIMA, etc.);
  • Multilevel regression with poststratification (Mr.P);
  • Bayesian models

The data

What will we do today:

  • Will experiment with simple forecasting methods
    • Look at polling averages;
    • Look at some time series models;
    • Will do some data processing and visualization;

The data (now on Georgia)

  • Public opinion polls conducted before 2012 and 2016 elections;
  • I scraped 2012 polls from Wikipedia) and tracked 2016 polls through mass media
  • Election day exit polls;

Load the data:

setwd("D:\\Dropbox\\My Projects\\Courses\\R_LADIES\\ts_models\\repo")
parl12 <- read.csv("data/parl12.csv", sep="\t")
parl16 <- read.csv("data/parl16.csv", sep="\t")

Load libraries:

library(ggplot2)
library(reshape2)
library(ggthemes) # Some fancy themes for your charts
library(gridExtra) # Combine your charts
library(grid)
library(zoo)
library(forecast)
library(e1071)

Load libraries (at once):

ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

packages<-c("ggplot2", "reshape2", "ggthemes", "forecast", "gridExtra", "grid", "zoo", "e1071")

ipak(packages)

Load Election results:

results12 <- data.frame(party=c("Dream", "UNM", "CDM", "OTH"), votes12=c(0.5490, 0.4040, 0.0205, 0.0265))

results16 <- data.frame(party=c("Dream", "UNM", "PA", "FD", "SFP", "UDM", "LAB", "OTH"), votes12=c(0.487, 0.271, 0.05, 0.046, 0.035, 0.035, 0.031, 0.045))

Polling average

  • Calculating polling averages is the simplest method of forecasting elections;
  • Though this method deterministic in nature, therefore it does not account for uncertainty;
    • Which is bad - social world is not deterministic;

Polling average

  • Georgia employs mixed electoral system:
    • 75 MPs elected through party lists;
    • 75 MPs elected in single-member districts (the “Majoritarians”);
  • In this exercise, we only look at proportional voting;

Polling average

  • We use apply function to calculate column means;
  • Then we attach names to the results and bind together;
names(parl12)
savg12 <- apply(parl12[5:8], 2, mean)
names(savg12) <- c("Dream", "UNM", "CDM", "OTH")
total12 <- cbind(results12, savg12)

names(parl16)
savg16 <- apply(parl16[4:(ncol(parl16)-2)], 2, mean)
names(savg16) <- c("Dream", "UNM", "PA", "FD", "SFP", "UDM", "LAB", "OTH")
total16 <- cbind(results16, savg16)

Weighted average

  • Simple average doesn't give correct prediction, as we do not account whether the poll is new or old;
  • Here we need some “Bayesian” thinking - we have to update our beliefs based on new polls. That is, give more weight to newer polls;
  • Polls in this case are weighted with inverse of the number of days left to elections; exit polls have weight 1.

Weighted average

  • Note that we also use apply and weighted.mean;
  • The result is a list object, therefore we use melt from reshape2 package in order to turn it into tidy table.
wavg12 <- lapply(parl12[5:8], weighted.mean,  w = parl12$weight)
wavg12 <- melt(wavg12)
names(wavg12) <- c("Dream", "UNM", "CDM", "OTH")
total12 <- merge(total12, wavg12, by="party")

wavg16 <- lapply(parl16[3:(ncol(parl16)-2)], weighted.mean,  w = parl16$weight)
wavg16 <- melt(wavg16)
names(wavg16) <- c("Dream", "UNM", "PA", "FD", "SFP", "UDM", "LAB", "OTH")
total16 <- merge(total16, wavg16, by="party")

Visualization

  • No quantitative analysis could be fulfilled without visualizing the data;
  • In order to create beautiful graphs, like ones on FiveThirtyEight website, I will use ggplot2 and ggthemes libraries;

Visualization

  • ggplot2 utilizes datasets in tidy format, that is, “each variable is a column, each observation is a row, and each type of observational unit is a table” (Wickham, 2014)
  • Data with such structure is easy to manipulate and is more efficient from the computational perspective;
  • I use reshape2's melt function to turn total* tables into tidy format.

Tidying messy dataset

total12 <- melt(total12[1:4])
total12$elections <- c("2012")

total16 <- melt(total16[1:4])
total16$elections <- c("2016")
total <- rbind(total12, total16)

levels(total$variable) <-c("Vote Share", "Simple Average", "Weighted Average","Simple Average")

Visualization

# Filter out necessary chunck of data and indicate aesthetics:
ggplot(data=total[total$elections=="2012",], aes(party, value, fill = party))+
# Indicate what kind of graph we are making
  geom_col()+
# Limit Y axis from 0 to 1
  ylim(0, 1)+
# Use manual colors for political parties
  scale_fill_manual(values = c("#9a142c", "#195ea2", "grey", "#e4012e"))+
# Create three facetes for each type of analysis (vote share, weighted average, simple average)
  facet_grid(~variable)+
# Attach labels to X axis
  scale_x_discrete(limits=c("UNM", "Dream", "CDM", "OTH"), labels=c("UNM", "GDC", "CDM", "Others"))+
# Use fancy FiveThirtyEight-like theme from ggthemes library
  theme_fivethirtyeight()+
# Remove legend
  theme(legend.position="none"
        )+
# Attach custom title and labels
    labs(title = "2012",
       x = "Parties",
       y = "%")+
# Add horizontal red line to denote 5% threshold
    geom_hline(yintercept = 0.05, color = "red")

Visualization

ggplot(data=total[total$elections=="2016",], aes(party, value, fill = party))+
  geom_col()+
  ylim(0, 1)+
  facet_grid(~variable)+
  scale_x_discrete(limits=c("Dream", "UNM", "PA", "FD", "SFP", "UDM", "LAB", "OTH"), labels=c("Dream", "UNM", "PA", "FD", "SFP", "UDM", "LAB", "Other"))+
scale_fill_manual(values = c("#195ea2", "grey", "#e4012e", "#003087", "#faa41f", "#e7b031", "#ec1c24", "#33ace2"))+    
  theme_fivethirtyeight()+
  theme(legend.position="none"
        )+
  labs(title = "2016",
       x = "Parties",
       y = "%") +
  geom_hline(yintercept = 0.05, color = "red")

Should we be certain about our predictions?

  • Of course, no! All our predictions are deterministic!

Moving averages

  • Calculating moving averages is one of the simplest methods of time-series analysis;
  • In short, when applying moving averages we are trying to interpolate future values based on previous data points;
  • It's a method for smoothing noisy data points and derive a theoretical line which will possibly forecast future values;

Moving averages

  • We will implement moving average usign zoo package;
  • In order to do so, we first need to turn time series data into zoo format and then tell R what is the order of observations;
  • Then apply rollmean function and indicate the number of future predictions;

Moving averages

parl12$Date <- as.Date(parl12$Date, format="%m/%d/%Y")

parl12.zoo <- zoo(x=parl12$UNM, order.by = parl12$Date)

rollmean(parl12.zoo, 1)

parl16$Date <- as.Date(parl16$Date, format="%m/%d/%Y")

parl16.zoo <- zoo(x=parl16$UNM, order.by = parl16$Date)

rollmean(parl16.zoo, 1)

Holt-Winters exponential smoothing

  • It's another method for forecasting time-series data;
  • Similar to moving average, it's another smoothing method;
  • In R, we implement Holt-Winters algorithm using forecast library;
  • Follow this tutorial for more detailed take on this topic;

Holt-Winters exponential smoothing

WTI.hwm <- HoltWinters(parl12$UNM, gamma=FALSE )
WTI.hwf <- forecast.HoltWinters(WTI.hwm, h=3)
summary(WTI.hwf)

Take-away

  • What's the problem with time-series data on polls?
    • They are pretty scarce;
    • We don't have individual-level data in order to properly model electoral behavior;

Therefore, we need to be careful when interpreting forecasts!

Thank you!

Drawing david@sichinava.ge

Drawing davidsichinava

Drawing @davidsichinava